In [ ]:
from traitlets.config.manager import BaseJSONConfigManager
from IPython.core.display import display, HTML
from numpy import *
from numpy.random import *
from matplotlib.pyplot import figure, show, draw, tight_layout
from numpy import log2, ceil
import sympy
%matplotlib inline

path = '/home/datasci/.jupyter/nbconfig'
cm = BaseJSONConfigManager(config_dir=path)

cm.update('livereveal',
         {
             'theme': 'safri',
             'start_slideshow_at': 'selected',
             'width': 1280,
             'height': 960,
             'scroll': False,
             'progress': True,
             'controls': True,
             'slideNumber': True
         })

An Introduction to Data Science
using Docker and Jupyter





Author Josh Cole
Previous Company: General Dynamics
Previous Position: Systems Engineer
Unviersity: The University of Bristol
Studied: MEng in Electronics and Communications
Current Role: Sofware Engineer/Data Science/Big Data

Overview

    • Why Docker?
    • Why Jupyter?
    • Can anyone Get involved with Data Science?
    • What Skills do you need?
    • Big data the four Vs
    • Learning Resources
    • A brief play with Docker and Jupyter

Why Docker?

  • Standard Dev Environment
    • Configuring a data science environment can be a pain
    • Dealing with inconsistent package versions
    • Having to dive through obscure error messages
    • Wait hours for packages to compile can be frustrating
  • Moving to Docker
    • The above makes it hard to get started with data science in the first place, and is a completely arbitrary barrier to entry
    • Dealing with inconsistent package versions
    • With Docker, we can download an image file that contains a set of packages and data science tools

Why Jupyter?

    • It excels in literate programming, a software style pioneered by Stanford computer scientist, Donald Knuth
    • Allows users to formulate, and describe their thoughts with prose, supplemented by mathematical equations as they prepare to write code blocks
    • Commonly used in:
      • Demonstrations
      • Research
      • Teaching
      • Collaborative exercise
    • Supports:
      • Latex equations using MathJax
      • MarkDown Cells
      • Interactive presentations

Can anyone Get involved with Data Science?

  • Do you need PhD in Statistics/Machine Learning?
    • Many data scientists acquired their quantitative and statistical modeling skills in college, but pursued degrees in business, economics and engineering
    • The actually know about business problems
  • Do Data Scientist get the hands Dirty?
    • Data Scientists get their fingernails dirty dumping piles of data in analytical sandboxes, cleansing, and sifting through it for useful patterns that may or may not exist. Then, they do it all over again.

What skills do you need?

Data Science Profile from Doing Data Science by Cathy O'Neil and Rachel Schutt
There is no "I" n "Team": Don't go it alone

Big Data the Four Vs

  • Volume: Data at rest i.e. the amount of data
  • Variety: Data in many forms:
    • Different types of data (e.g. structured, semi-structured and unstructured data
    • Different data source (e.g. internal, external, public)
  • Velocity: data in motion i.e the speed at which data is generated and processed
  • Veracity: data in doubt i.e. the varying levels of noise and processing errors

Learning Resources

  • Coursera: high quality courses on Data Science/Machine Learning/Statistics: https://www.coursera.org/
  • Udacity: similar to Coursera: https://www.udacity.com/
  • Kaggle: competitions, datasets tutorials: https://www.kaggle.com/
  • KDnuggets: datasets, blogs and tutorials: https://www.kaggle.com/
  • Just Google it!

A brief play with Docker and Jupyter

  • Github: https://github.com/JoshCole/DataScience-Stack
  • A quick Tour

Questions